1. INTRODUCTION

The data science industry, at the intersection of statistics, computer science, and business analysis, has rapidly grown into a critical field for data-driven decision-making and innovation. It plays a pivotal role in various sectors, including technology, finance, healthcare, and retail, among others. The industry’s growth is fueled by the increasing generation of data and the need for sophisticated tools and methodologies to extract insights and inform business strategies. Data scientists are highly sought after for their ability to analyze complex datasets, create predictive models, and communicate findings effectively. The evolving nature of the industry, marked by advancements in machine learning, artificial intelligence, and big data technologies, continues to expand the scope and impact of data science roles, making it a dynamic and future-focused career field.

Understanding salary levels, especially in the field of data science, is valuable for several reasons, both from an individual’s career perspective and from an organizational standpoint. Research Question of the analysis: What factors contribute to the top quartile of data science salaries? Salary_in_usd variable will be used as a main variable in order to ensure standarization and comparability of data for different countries.

This report consists of several parts, including data evaluation, explanatory data analysis and advanced analysis.

2. DATA EVALUATION

2.1. SOURCE:

Data used for this analysis was obtained from Kaggle - a platform for data scientists (link:https://www.kaggle.com/datasets/arnabchaki/data-science-salaries-2023/data) in a form of a .CSV file. Uploading data:

2.2. MISSING VALUES DETECTION:

There are no missing values in the dataset.

## [1] 0

2.3 DESCRIPTIVE STATISTICAL ANALYSIS OF DATASET:

Statistical Summary of Salaries in USD
Statistic Value
Standard Deviation 63055.63
Variance 3976011879.23
Interquartile Range (IQR) 80000.00
Lower Bound for Outliers -25000.00
Upper Bound for Outliers 295000.00

Interpretation of the values: The high variance value reiterates that there’s a substantial spread in the salary data, indicating diverse salary ranges within the field. An IQR of 80,000 USD suggests that the central half of the data has a wide salary range, emphasizing the diversity in compensation across different data science positions. When it comes to lower bound for outliers, theoretically, this threshold is at -25,000 USD, but practically, negative salaries are not feasible. This implies that there are few to no extreme low outliers in the salary data. Next, upper bound for outlierS shows that salaries above 295,000 USD are outliers, signifying extremely high-paying roles in the data science industry. These might be associated with highly specialized skills, leadership roles, or specific high-paying industries or regions. In conclusion, the data shows a broad range of salaries within the field of data science, indicating a diverse industry with varying levels of compensation. This range can be attributed to factors such as geographical location, level of education, experience, and specific job roles. The presence of high-paying outliers suggests opportunities for significantly lucrative roles in the industry. Understanding these salary dynamics is crucial for both professionals navigating their career paths and organizations structuring their compensation strategies.

2.4. BOXPLOT FOR THE DISTRIBUTION AND OUTLIERS IN SALARY IN USD:

Below there is a boxplot illustrating the aforementioned statistics. The presence of outliers, indicated by points beyond the “whiskers” of the boxplot, suggests significant variation in the upper range of salaries. These outliers could represent highly specialized roles, exceptionally experienced individuals, or specific industries within data science where salaries are markedly higher.

2.5. DETECTION OF INCONSISTENCIES IN CATEGORICAL DATA:

The frequency tables below show several interesting trends: 1. The most frequent experience level is Senior; 2. Most of employees work full time; 3. There is prevalence of medium sized companies; 4. The most frequent job titles are: Data Engineer, Data Scientist, Data Analyst, followed by Machine Learning Engineer, Analytics Engineer and Data Architect. There are many job titles which occur only once or twice. 5. When it comes to employee residence, there is an obvious trend - most data regards United States of America. 6. Unsurprisingly, most companies are located in United States of America.

Such huge differences in terms of frequency create difficulty in further analysis. For this reason, both job titles and employee residence values will be respectively categorized according to the career fields and regions.

Frequency of Experience Level
Value Frequency
SE 2516
MI 805
EN 320
EX 114
Frequency of Employment Type
Value Frequency
FT 3718
PT 17
CT 10
FL 10
Frequency of Company Size
Value Frequency
M 3153
L 454
S 148

In order to assess the frequency of job titles they were assigned their frequencies in the table below. Clearly, there are some prevalent job titles, but what is interesting, many job titles are unique. It can be due to different phrasing, which would mean the role is the same, but the job title is diffrent. On the other hand, uniqueness of some job titles may be due to highly specialized and innovative roles.

Summary of Job Titles
Frequency Value Count
1040 Data Engineer 1
840 Data Scientist 1
612 Data Analyst 1
289 Machine Learning Engineer 1
103 Analytics Engineer 1
101 Data Architect 1
82 Research Scientist 1
58 Applied Scientist, Data Science Manager 2
37 Research Engineer 1
34 ML Engineer 1
29 Data Manager 1
26 Machine Learning Scientist 1
24 Data Science Consultant 1
22 Data Analytics Manager 1
18 Computer Vision Engineer 1
16 AI Scientist 1
15 BI Data Analyst, Business Data Analyst 2
14 Data Specialist 1
13 BI Developer 1
12 Applied Machine Learning Scientist 1
11 AI Developer, Big Data Engineer, Director of Data Science, Machine Learning Infrastructure Engineer 4
10 Applied Data Scientist, Data Operations Engineer, ETL Developer, Head of Data, Machine Learning Software Engineer 5
9 BI Analyst, Head of Data Science, Lead Data Scientist 3
8 Data Science Lead, Principal Data Scientist 2
7 Data Quality Analyst, Machine Learning Developer, NLP Engineer 3
6 Data Analytics Engineer, Data Infrastructure Engineer, Deep Learning Engineer, Lead Data Engineer, Machine Learning Researcher 5
5 Cloud Database Engineer, Computer Vision Software Engineer, Data Science Engineer, Lead Data Analyst, Product Data Analyst 5
4 3D Computer Vision Researcher, Business Intelligence Engineer, Data Operations Analyst, Machine Learning Research Engineer, MLOps Engineer 5
3 Cloud Data Engineer, Financial Data Analyst, Lead Machine Learning Engineer, Machine Learning Manager 4
2 AI Programmer, Applied Machine Learning Engineer, Autonomous Vehicle Technician, Big Data Architect, Data Analytics Consultant, Data Analytics Lead, Data Analytics Specialist, Data Lead, Data Modeler, Data Scientist Lead, Data Strategist, ETL Engineer, Insight Analyst, Marketing Data Analyst, Principal Data Analyst, Principal Data Engineer, Software Data Engineer 17
1 Azure Data Engineer, BI Data Engineer, Cloud Data Architect, Compliance Data Analyst, Data DevOps Engineer, Data Management Specialist, Data Science Tech Lead, Deep Learning Researcher, Finance Data Analyst, Head of Machine Learning, Manager Data Management, Marketing Data Engineer, Power BI Developer, Principal Data Architect, Principal Machine Learning Engineer, Product Data Scientist, Staff Data Analyst, Staff Data Scientist 18
Summary of Employee Residence
Frequency Value Count
3004 US 1
167 GB 1
85 CA 1
80 ES 1
71 IN 1
48 DE 1
38 FR 1
18 BR, PT 2
16 GR 1
15 NL 1
11 AU 1
10 MX 1
8 IT, PK 2
7 IE, JP, NG 3
6 AR, AT, PL 3
5 BE, PR, SG, TR 4
4 CH, CO, LV, RU, SI, UA 6
3 AE, BO, DK, HR, HU, RO, TH, VN 8
2 AS, CF, CL, CZ, FI, GH, HK, KE, LT, PH, SE, UZ 12
1 AM, BA, BG, CN, CR, CY, DO, DZ, EE, EG, HN, ID, IL, IQ, IR, JE, KW, LU, MA, MD, MK, MT, MY, NZ, RS, SK, TN 27
Summary of Salary Currency
Frequency Value Count
3224 USD 1
236 EUR 1
161 GBP 1
60 INR 1
25 CAD 1
9 AUD 1
6 BRL, SGD 2
5 PLN 1
4 CHF 1
3 DKK, HUF, JPY, TRY 4
2 THB 1
1 CLP, CZK, HKD, ILS, MXN 5
Summary of Company Location
Frequency Value Count
3040 US 1
172 GB 1
87 CA 1
77 ES 1
58 IN 1
56 DE 1
34 FR 1
15 BR 1
14 AU, GR, PT 3
13 NL 1
10 MX 1
7 IE 1
6 AT, JP, SG 3
5 CH, NG, PL, TR 4
4 BE, CO, DK, IT, LV, PK, PR, SI, UA 9
3 AE, AR, AS, CZ, FI, HR, LU, RU, TH 9
2 CF, EE, GH, HU, ID, IL, KE, LT, RO, SE 10
1 AL, AM, BA, BO, BS, CL, CN, CR, DZ, EG, HK, HN, IQ, IR, MA, MD, MK, MT, MY, NZ, PH, SK, VN 23

2.6. DATASET OVERVIEW

STRUCTURE OF DSALARIES DATASET:

Below there are listed all types of values in the dsalaries dataset.

Structure of the dsalaries Dataset
Variable Description
work_year Year of the work data
experience_level Level of experience
employment_type Type of employment
job_title Title of the job
salary Salary in local currency
salary_currency Currency of the salary
salary_in_usd Salary in USD
employee_residence Country of residence of the employee
remote_ratio Percentage of work done remotely
company_location Location of the company
company_size Size of the company

2.7. EMPLOYEE_RESIDENCE CATEGORIZED BY REGIONS:

For the sake of insightful analysis, employee residence variable is categorized according to region. As Europe and North America have the most frequencies, and other regions have very few occurences, they were combined into category: “Other Regions”. This category will include: South America, Africa and Oceania.

Region Employee Residence
Europe AL, AD, AT, BA, BE, BG, BY, CH, CY, CZ, DE, DK, EE, ES, FI, FR, GB, GR, HR, HU, IE, IS, IT, JE, LT, LU, LV, MC, MD, ME, MK, MT, NL, NO, PL, PT, RO, RS, RU, SE, SI, SK, TR, UA
Other Regions AE, AM, CN, HK, ID, IL, IN, IQ, IR, JP, KW, MY, PH, PK, SG, TH, UZ, VN, AR, BO, BR, CL, CO, PE, UY, VE, CF, DZ, EG, GH, KE, MA, NG, TN, AS, AU, NZ
North America CA, CR, DO, HN, MX, PR, US

2.8. JOB TITLE CATEGORIES

Because of high number of different job titles and for the ease of further analysis, the job titles were categorized into following groups. Such criteria as: relevance to core skills and responsibilities, industry-standard role definitions, overlap with related fields, hierarchical and management aspects and specialization or unique focus were taken into account. The category: “Emerging Technologies and Specialized Roles” is important, as it contains many job titles, but they don’t occur frequently. Because of their degree of specialization it is crucial to include them in the analysis, especially to assess if niche roles can have high salaries.

Category Job Titles
Data Engineering & Architecture Data Engineer, Data Architect, Big Data Engineer, Data Infrastructure Engineer, Data Operations Engineer, AI Developer, Director of Data Science, Cloud Database Engineer, Lead Data Engineer, Cloud Data Engineer, Principal Data Engineer, Software Data Engineer
Data Science & Analytics Data Scientist, Data Analyst, Applied Scientist, Applied Data Scientist, Data Science Manager, Data Science Engineer, Data Manager, Data Science Consultant, Data Analytics Manager, BI Data Analyst, Business Data Analyst, Data Specialist, BI Developer, BI Analyst, Head of Data Science, Head of Data, Lead Data Scientist, Data Science Lead, Principal Data Scientist, Data Quality Analyst, NLP Engineer, Lead Data Analyst, Lead Data Scientist, Product Data Analyst, Data Operations Analyst, Cloud Data Engineer, Financial Data Analyst, Lead Machine Learning Engineer, Machine Learning Manager, Data Analytics Consultant, Data Analytics Lead, Data Analytics Specialist, Data Analytics Engineer, Data Lead, Data Modeler, Data Scientist Lead, Data Strategist, ETL Engineer, ETL Developer, Insight Analyst, Marketing Data Analyst, Principal Data Analyst
Machine Learning & Advanced Research Machine Learning Developer, Applied Machine Learning Engineer, Machine Learning Engineer, Analytics Engineer, Research Scientist, Research Engineer, ML Engineer, Machine Learning Scientist, Machine Learning Software Engineer, Machine Learning Research Engineer, Applied Machine Learning Scientist, Big Data Engineer, Director of Data Science, Machine Learning Infrastructure Engineer, Machine Learning Researcher
Emerging Technologies & Specialized Roles AI Developer, AI Scientist, AI Programmer, Applied Scientist, Data Science Manager, Deep Learning Engineer, Machine Learning Researcher, 3D Computer Vision Researcher, Business Intelligence Engineer, Azure Data Engineer, BI Data Engineer, Cloud Data Architect, Compliance Data Analyst, Data DevOps Engineer, Data Management Specialist, Data Science Tech Lead, Deep Learning Researcher, Finance Data Analyst, Head of Machine Learning, Manager Data Management, Marketing Data Engineer, Power BI Developer, Principal Data Architect, Principal Machine Learning Engineer, Product Data Scientist, Staff Data Analyst, Staff Data Scientist, Autonomous Vehicle Technician, Big Data Architect, Data Lead, Data Modeler, Data Strategist, ETL Engineer, Insight Analyst, Marketing Data Analyst, Principal Data Analyst, Computer Vision Engineer, Computer Vision Software Engineer, MLOps Engineer

Below there is a barplot illustrating the frequency of job title categories. Creation of category ‘Emerging Technologies & Specialized Roles’ aims to examine the possibility of high salaries among the most unique, innovative and specialized roles in data science field.

3. EXPLANATORY DATA ANALYSIS (EDA)

3.1. DESCRIPTIVE STATISTICS

3.1.1. SUMMARY OF THE DATASET:

The summary of the dataset dsalaries reveals some important observations, for instance: - wide range in salary figures; - diversity in remote work arrangements; - high maximum salaries etc.

Some of this observations will be further explored in next parts of this report.

Variable X X.1 X.2 X.3 X.4 X.5
work_year Min. :2020 1st Qu.:2022 Median :2022 Mean :2022 3rd Qu.:2023 Max. :2023
experience_level Length:3755 Class :character Mode :character NA NA NA
employment_type Length:3755 Class :character Mode :character NA NA NA
job_title Length:3755 Class :character Mode :character NA NA NA
salary Min. : 6000 1st Qu.: 100000 Median : 138000 Mean : 190696 3rd Qu.: 180000 Max. :30400000
salary_currency Length:3755 Class :character Mode :character NA NA NA
salary_in_usd Min. : 5132 1st Qu.: 95000 Median :135000 Mean :137570 3rd Qu.:175000 Max. :450000
employee_residence Length:3755 Class :character Mode :character NA NA NA
remote_ratio Min. : 0.00 1st Qu.: 0.00 Median : 0.00 Mean : 46.27 3rd Qu.:100.00 Max. :100.00
company_location Length:3755 Class :character Mode :character NA NA NA
company_size Length:3755 Class :character Mode :character NA NA NA
region Length:3755 Class :character Mode :character NA NA NA
job_title_category Length:3755 Class :character Mode :character NA NA NA

3.1.2. SALARY IN USD DISTRIBUTION AND DENSITY:

This histogram displaying density and distribution od salaries in USD shows clearly that most salaries fall roughly between 100,000 USD and 200,000 USD. The peak of the density plot aligns with the salary range where the highest number of data points are found.

The summary of dsalaries dataset reveals that the lowest salary is 5132 USD and maximum one is 450000 USD. It can be said that the extent to which the density plot deviates from the center (median) can indicate skewness in the salary data. For instance, a long tail on the right of the density plot would suggest that a smaller number of individuals have salaries significantly higher than the median, which is consistent with the wide range observed.

3.2. VARIABLE RELATIONSHIPS

3.2.1.CORRELATION MATRIX HEATMAP

The correlation heatmap shows, that in case of remote ratio, salary (in currencies from respective employee residence) and work year, the impact on salary in USD variable is seemingly negligible.

3.2.2. VIOLIN PLOT FOR SALARY IN USD VS. EXPERIENCE LEVEL

This interactive violin plot for salary in USD vs. experience level enables examination of all statistics for different categories of experience level. Some general interpretations can be deduced: - Larger Company Compensation: The presence of higher and more varying salaries in larger companies could reflect the broader scope of roles, availability of resources, and the ability to pay for highly specialized skills. - Smaller Company Dynamics: Smaller companies showing a narrower range of salaries could be due to a number of factors including less role differentiation, budget constraints, or a more unified salary structure. - Overall Salary Trends: The fact that outliers are present across all company sizes indicates that exceptionally high salaries are not exclusive to any particular company size and could be influenced more by individual role, skill level, or negotiation.

3.2.3. BOXPLOT FOR SALARY IN USD BY COMPANY SIZE

Analysis of the Boxplot:

  • Variability in Salary: There’s a noticeable difference in the interquartile range (IQR) – the box part of the boxplot – among the three company sizes. Large companies have a wider IQR compared to medium and small companies, suggesting more variability in the salaries offered by larger companies.

  • Median Salary Comparison: The median salary – indicated by the line in the middle of each box – appears to be highest in large companies, followed by medium and then small companies. This trend is a common observation in the industry, as larger companies often have more resources to offer competitive salaries.

  • Outliers: There are numerous outliers for large and medium companies, represented by the individual dots outside the upper whiskers of the boxplot. This suggests that within these company sizes, there are positions that command exceptionally high salaries, possibly due to specialized skills, senior roles, or other factors.

  • Lower Salary Range: Small companies show a compact box with fewer outliers, which could indicate a more uniform salary structure with less deviation from the median salary.

3.2.4. SALARY IN USD DISTRIBUTION IN REGIONS

  1. General Observations:
  • North America: The box representing North America shows a higher median salary compared to Europe and Other Regions. The range of salaries is also quite broad, with a number of outliers indicating extremely high salaries.
  • Europe: Europe’s median salary is lower than that of North America, and the interquartile range (IQR) is more compact, suggesting less variability in salaries than in North America.
  • Other Regions: The Other Regions category has the lowest median salary and a smaller IQR, indicating more consistency in salaries, albeit at a lower range compared to Europe and North America.
  1. Outliers:
  • There is a significant number of outliers in the North American region, indicating that salaries can reach very high levels, possibly due to the presence of tech hubs like Silicon Valley where top-tier salaries are common.
  • Europe and Other Regions also show outliers, but these are fewer and less extreme compared to North America.
  1. Implications for the Report:
  • The plot supports the statement that the highest salaries in the data science field are more likely to be found in North America, which could be influenced by the higher cost of living, the concentration of multinational corporations, and a mature tech industry.
  • The data also suggests that while Europe and Other Regions may offer competitive salaries, the most lucrative opportunities, in terms of salary potential, are more prevalent in North America.

3.2.5.AVERAGE SALARY IN USD FOR EMPLOYEE RESIDENCE IN REGIONS

This scatter plot visualizes the average salaries for data science roles based on the employee’s residence within each region: Europe, North America, and Other Regions. Here’s a succinct analysis: - Europe: Shows a cluster of average salaries with a tight range, suggesting less variability in pay across different countries within the region. - North America: Displays higher average salaries than Europe, with a spread indicating that some residences in North America have significantly higher average salaries than others. - Other Regions: There’s a wide spread in average salaries, with some residences showing comparable averages to North America, potentially indicating the presence of high-paying countries outside the traditional economic centers. The plot underscores the regional disparities in average data science salaries and suggests that residence within these regions can be a strong indicator of salary expectations.

3.2.6. SALARY IN USD DISTRIBUTION BY JOB TITLE GROUP

Overall, the plot shows that more specialized and advanced job title groups tend to have a higher variation in salaries, with the potential for significantly higher pay. Here are some conclusions: Here’s a succinct analysis: - Data Engineering & Architecture: Shows a moderate median salary with a relatively compact interquartile range (IQR), indicating consistency in salaries within this group. - Data Science & Analytics: This group has a similar median salary to Data Engineering & Architecture but a slightly wider IQR, suggesting more variation in pay. - Emerging Technologies & Specialized Roles: This category exhibits a wider IQR and higher median salary, which could reflect the high demand and compensation for specialized skills. - Machine Learning & Advanced Research: Has the widest IQR, indicating a significant spread in salaries, with some very high outliers, reflecting the premium paid for advanced ML expertise and research roles.

3.2.7. AVERAGE SALARY FOR JOB TITLES IN THEIR RESPECTIVE CATEGORIES

The scatter plot displays the average salaries for various job titles within each job category in the data science field. Here’s a succinct analysis: - Data Engineering & Architecture: This category shows a cluster of job titles with average salaries mostly in the lower to middle salary range, suggesting that while important, these roles may not command the highest salaries. - Data Science & Analytics: There’s a broad distribution of average salaries, indicating variability in compensation which may reflect a range of specializations and responsibilities within this category. - Emerging Technologies & Specialized Roles: The average salaries are dispersed across a wide range, with several job titles commanding higher average salaries, highlighting the value of niche skills in the market. - Machine Learning & Advanced Research: This category shows a concentration of higher average salaries, underscoring the industry’s demand for advanced technical skills and research capabilities.

4. CLUSTERING OF THE DATASET DSALARIES

4.1. GENERAL CLUSTERING

Variables salary_in_usd, work_year, and remote_ratio were chosen for clustering. These variables are scaled to ensure that they contribute equally to the clustering process.

4.1.1. ELBOW METHOD FOR CHOOSING OPTIMAL K

Analysis of the Elbow Method Plot: - Decreasing WSS: Initially, as k increases, there is a steep decline in WSS, indicating significant gains from increasing the number of clusters. - Elbow Point: The “elbow” of the plot appears to be at k = 4, where the rate of decrease sharply diminishes. This inflection point suggests that adding more clusters beyond this number results in diminishing returns in terms of WSS reduction. - Optimal Clusters: Based on this plot, k = 4 is identified as the optimal number of clusters for the data because it represents a balance between minimizing WSS and avoiding overfitting with too many clusters.

4.1.2. K-MEANS CLUSTERING

K-means clustering algorithm was run with k set to 4. Clusters have been defined based on similarities in salary, work year, and remote work ratio.

Cluster Counts
Cluster Count
1 1298
2 631
3 1207
4 619
Cluster Centers
Cluster Salary..Scaled. Work.Year..Scaled. Remote.Ratio..Scaled.
1 -0.3885587 0.2986313 -0.9301078
2 1.2550413 0.4566456 -0.8919661
3 0.2802829 0.1498031 1.0963928
4 -1.0111200 -1.3838113 0.7217518

Values above indicate, that:

Cluster 1: represents employees with lower salaries, relatively newer roles, and infrequent remote work. The negative values in Salary (Scaled) and Remote Ratio (Scaled) indicate lower salaries and infrequent remote work, while the positive value in Work Year (Scaled) suggests relatively newer employees in their roles.

Cluster 2: represents highly paid employees with slightly more experience and a low remote work ratio. The high positive value in Salary (Scaled) suggests higher salaries, and the positive value in Work Year (Scaled) indicates more experience. The negative value in Remote Ratio (Scaled) suggests a low tendency for remote work, possibly indicating in-office high-level professionals.

Cluster 3: represents employees with average salaries, average experience, and frequent remote work. The positive value in Salary (Scaled) suggests average salaries, and the positive value in Remote Ratio (Scaled) indicates a high tendency for remote work, possibly indicating remote workers or freelancers.

Cluster 4: represents the least paid, least experienced employees with a higher tendency for remote work. The highly negative values in both Salary (Scaled) and Work Year (Scaled) indicate lower salaries and less experience. The positive value in Remote Ratio (Scaled) suggests a higher tendency for remote work, possibly indicating entry-level or intern positions that offer remote work options.

4.1.3. PCA

The scatter plot visualizes the first two principal components obtained from a PCA (Principal Component Analysis). The points are colored according to the four clusters identified by the k-means algorithm. - Distinct Clusters: The PCA plot shows that the four clusters are distinct, as they are spread out across the first two principal components, which are the dimensions capturing the most variance. - Cluster Overlap: There appears to be some overlap between the clusters, particularly between clusters 1 and 2. This suggests some similarity between these groups in the multidimensional space of the original variables. - PCA Effectiveness: The clear separation of clusters along PC1 and PC2 indicates that PCA is effective in reducing dimensionality while still preserving the structure necessary for cluster differentiation.

4.1.4. SILHOUETTE PLOT

Mean Values by Cluster
cluster work_year salary salary_in_usd remote_ratio
1 2022.580 125331.6 113069.58 1.078582
2 2022.689 218123.8 216707.80 2.931854
3 2022.477 161147.3 155243.80 99.544325
4 2021.417 357416.1 73813.58 81.340872
Following conclusions can be drawn from the above table:

Work_Year: Cluster 1: The average work year is approximately 2022.58, suggesting most data points are from around mid-2022. Cluster 2: Slightly later, with an average work year around late 2022 (2022.689). Cluster 3: Similar to Cluster 1, with an average year around early to mid-2022 (2022.477). Cluster 4: The average year falls around early 2021 (2021.417), indicating this cluster contains older data.

Salary: Clusters 1, 2, and 3 have average salaries of approximately 125,332, 218,124, and 161,147 respectively. These figures represent the average salary without considering currency differences. Cluster 4 has a significantly higher average salary of around 357,416.

Salary_in_USD: This column normalizes salaries across clusters to US dollars, facilitating direct comparison. Clusters 1, 2, and 3 have average salaries in USD of around 113,070, 216,708, and 155,244 respectively. Cluster 4, despite having the highest average nominal salary, has a lower average when converted to USD (73,814), suggesting this cluster might contain data from countries with higher nominal salaries but lower value in USD.

Remote_Ratio: For Cluster 1, the ratio is around 1.08, suggesting very low remote work prevalence. Cluster 2 has a slightly higher ratio of around 2.93, indicating a marginal increase in remote work. Cluster 3 shows a significant jump, with a ratio of 99.54, suggesting almost entirely remote work. Cluster 4 also indicates high remote work prevalence (81.34), although not as high as Cluster 3.

4.2. CLUSTERING BASED ON SALARY IN USD, JOB TITLE AND EMPLOYEE RESIDENCE

4.2.1. DATA PREPARATION:

4.2.2. ELBOW METHOD FOR CHOOSING OPTIMAL K

Based on this plot, one might choose k=4 for clustering as it appears to be the point after which the reductions in WSS become less significant, indicating that additional clusters do not contribute much to explaining the variance.

4.2.3. K=MEANS CLUSTERING

Overall, the clustering suggests that salaries in the data science field are influenced by job title and geographic location, with significant variance between different clusters. The larger clusters likely represent more common salary ranges and roles, while the smaller clusters may reflect specialized or regional characteristics of the data science job market.

Cluster 4: This is the largest cluster with 3006 individuals, indicating it may represent the most common salary range and job characteristics within the dataset. The average salary is relatively high at approximately $153,005, with a median close to $145,000, suggesting a strong central concentration around this salary level. The wide salary range indicates significant diversity within this group. The most common job title is “Data Engineer,” and the most common residence is the United States, which could imply that data engineering is a lucrative and common role in the US data science job market.

Cluster 3: Comprising 733 individuals, this cluster has a lower average salary of around $76,055 and a median of $65,000, which might reflect early to mid-career positions. The salary range is also wide, potentially indicating a variety of job roles within this cluster. The prevalent job title is “Data Scientist,” and the top residence is Great Britain, suggesting that data science roles in GB are diverse and possibly include a range from junior to senior positions.

Cluster 1: This is a very small cluster with only 14 individuals, which could represent a niche or specialized segment within the data science market. The average salary is around $60,800, with a narrower salary range. The primary job title is “Cloud Data Engineer,” and the top residence is Argentina, indicating a specific market or demand for cloud engineering expertise in that region.

Cluster 2: The smallest cluster, with just 2 individuals, has an average and median salary of $22,500, which is substantially lower than the other clusters. This may indicate entry-level positions or roles in regions with lower salary scales. The job title “Compliance Data Analyst” and the residence in Nigeria suggest these might be specialized roles in a particular sector or locale.

Summary of Clusters
cluster Count Average_Salary Median_Salary Salary_Range Top_Job_Title Top_Residence
4 3006 153004.70 145000 426000 Data Engineer US
3 733 76055.23 65000 294868 Data Scientist GB
1 14 60800.57 55000 148000 Cloud Data Engineer AR
2 2 22500.00 22500 15000 Compliance Data Analyst NG

4.2.4. PCA

Here’s an interpretation of the PCA plot:

  • Variance Explained: Both PC1 and PC2 explain a very small amount of the variance (1% each). This suggests that these two components do not capture the majority of the information in the dataset. The low variance explained by the first two principal components may indicate that the dataset is high-dimensional or that the variability is spread out over many variables.

  • Cluster Distribution: Despite the low variance explained, the clusters appear to be differentiated along the PC1 axis, though there is considerable overlap along the PC2 axis. This could mean that the feature or combination of features that most strongly define the clusters are captured by PC1.

  • Cluster Overlap: The significant overlap of clusters, especially along PC2, suggests that the clusters are not entirely distinct in the first two principal component dimensions. This might imply that the clusters are not well-separated in the higher-dimensional space or that more components are needed to achieve clear separation.

4.2.5. SILHOUETTE ANALYSIS

The silhouette plot reveals an unexpected insight into the clustering structure of the data science salaries dataset. The elbow method suggested four clusters as optimal for our dataset. Contrarily, the silhouette analysis showed mixed results: One cluster exhibits high silhouette scores, indicating strong internal agreement. Another cluster presents significant negative values, suggesting poor fit within the cluster. This discrepancy implies that: 1. The k-means assumption of spherical clusters may not hold for this data. 2. The actual structure of the data might be more complex than k-means can capture.

5. ADVANCED ANALYSIS

5.1. DECISION TREES

Below there are decision trees regarding every region: Europe, North America and Other Regions. They display decisions based on experience level and job title category. Salaries are presented as 10000 USD.

5.2. COLORS REPRESENTING CORRESPONDING QUARTILES AND SALARY IN USD RANGES

Below there is a plot displaying the meaning of colors - which quartile and which level of salary in USD they represent.

5.3. EXPLANATION OF ABBREVIATIONS OF JOB TITLE CATEGORIES

Below there is a table with explanation of abbreviations used in decision trees.

Legend for Job Title Category Abbreviations
Abbreviation
Data Engineering & Architecture DEA
Data Science & Analytics DSA
Emerging Technologies & Specialized Roles ETSR
Machine Learning & Advanced Research MLAR

5.4. DECISION TREES FOR REGIONS: EUROPE, NORTH AMERICA AND OTHER REGIONS

6. CONCLUSION

This report aimed to answer the research question: “What factors contribute to the top quartile of data science salaries?” Based on the analysis conducted, several key factors influencing salaries in the data science industry were identified.

6.1. KEY FINDINGS

  1. Experience Level and Job Title Impact: Higher salaries are commonly associated with senior-level positions and specialized job titles such as Machine Learning & Advanced Research roles. The decision trees and clustering analysis highlighted that experience and job title categories significantly influence salary levels.

  2. Geographical Variations: The salary distribution varied significantly across regions. North America generally offered higher salaries compared to Europe and other regions. This trend was evident in both the boxplots and the average salary trends over the years.

  3. Company Size: Larger companies tended to offer higher and more varying salaries. This was visible in the boxplot analysis where larger companies had a wider interquartile range, indicating diverse compensation strategies.

  4. Emerging Technologies and Specializations: Job titles within the “Emerging Technologies & Specialized Roles” category often correlated with higher salaries. This suggests that niche skills and innovative roles are highly valued in the industry.

  5. Remote Work Flexibility: The analysis showed varying impacts of remote work on salaries. While some high-paying roles offered flexibility, there was no consistent trend indicating a direct correlation between remote work and higher salaries.

6.2. IMPLICATIONS

  1. Career Path and Skill Development: For professionals in the data science field, focusing on gaining experience, enhancing specialized skills, and aiming for senior roles can be beneficial for salary progression.
  2. Geographic Considerations: Professionals might consider opportunities in regions like North America for potentially higher salaries, although this should be weighed against cost of living and personal circumstances.
  3. Company Choices: Working in larger companies might offer opportunities for higher salaries, but it’s essential to consider other factors like company culture, growth opportunities, and job security.
  4. Skill Specialization: Staying abreast of emerging technologies and developing niche skills can lead to lucrative opportunities, as seen in the higher salaries for specialized roles.

6.3. LIMITATIONS AND RECOMMENDATIONS FOR FURTHER RESEARCH

  1. Data Scope: The dataset primarily focused on certain regions and job titles. Expanding the data to include more diverse regions and emerging job roles would provide a more comprehensive understanding.
  2. Dynamic Industry Trends: The data science field is rapidly evolving. Continuous analysis with up-to-date data is necessary to understand current salary trends.
  3. Factor Interactions: Further research could explore the interactions between different factors, like how company size and experience level jointly influence salaries.

In conclusion, while the data science field offers diverse and lucrative career opportunities, factors such as experience level, job title, geographic location, and company size play crucial roles in determining salary levels. Continuous learning and adaptation to industry trends are key for professionals aiming to reach the top quartile of data science salaries.

`